About the project

Welcome

to my humble landing page and to the Introduction to Open Data Science-course together with me and the others!
Data science is growing in importance and on this course we will be learning related practical skills and of course, have fun when doing it!

Here is a link to my GitHub repository


Analysis part of the R-exercise 2

This week I’ve worked with wrangling, tidying and interpreting questionnaire data and learned how to plot and test linear regression models. Seems like attitude is a main factor affecting the test scores among the variables included in the study. The week has been a good refresher of the previous statistics courses and I have learned more about using Rmarkdown for presenting the results from R.

Introduction to the data learning2014

The data in question is a survey of students’ study skills SATS and their attitudes towards statistics ASSIST gathered 3.12.2014 - 10.1.2015. Students’ approach to study skills were divided into deep, strategic and surface categories. Here are the descriptions from a presentation by Kimmo Vehkalahti:.

Surface approach: memorizing without understanding, with a serious lack of personal engagement in the learning process.
Deep approach: intention to maximize understanding, with a true commitment to learning.
Strategic approach: the ways students organize their studying (apply any strategy that maximizes the chance of achieving the highest possible grades).

Points variable states the points received in a statistics exam. Those who had 0 points (which means that the participant did not attend the exam) were filtered out from the data.

Summary of the data

learning2014 <- read.csv("https://raw.githubusercontent.com/BreezewindX/IODS-project/master/data/learning2014.csv")
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The data consists of seven variables and has the query results of 166 persons, which of only 56 were men (maybe a bit small sample size). There was approximately twice the amount of females compared to males among the participants. The age of participants was between 17 and 55 years.

Scatterplot matrix between the variables

library(GGally)
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

From the matrix we can see that the variables that correlate positively with points in the exam the most are attitude (0.437) and stra (0.146). Variable surf (-0.144) is negatively correlated. Correlation typically refers to how close two variables are to having a linear relationship with each other.

There is not that much correlation between the study skills, largest is -0.324 between ‘deep’ and ‘surf’ (which is notable, but logical as it’s unlikely that a student using surface approach uses also deep approach).

The distribution of age seems to be heavily skewed to the left, which means that a big portion of the participants were relatively young. The median age for men was a bit higher than for women.

The mean of attitude score for men is somewhat higher than for women and in general not that many men scored low in attitude. There seems to be not much difference in the use of deep study skills between genders and women seem to rely more on the strategic study skills.

We’ll choose the following linear model for regression \[points_i \sim \beta_i+\beta_2attitude_i+\beta_3stra_i+\beta_4surf_i+\epsilon_i\] Where \(\beta_i\) is the intercept and \(\epsilon_i\) is the error term.

# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

According to Multiple R-squared value, predictors jointly explain only 0.21 (21%) of the observed variance on the dependent variable. F-statistic for the regression tells us that we can reject the null hypothesis that all our explanatory variables are zero, so we don’t have to abandon the model at this point.

The positive coeffiecient of 3.40 for variable attitude is statistically significant even at 0.001 significance level.

P-values for variables stra and surf are higher than 0.05, which means we cannot, at the 95% confidence level, reject the null hypothesis that their coefficients are zero (ceteris paribus). Thus they are candidates for discarding.

Let’s test and alternative hypothesis that both stra and surf are jointly zero, \(H_0:\beta_3=\beta_4=0\).

library(car)
ss_results<- linearHypothesis(my_model,  c("stra = 0", "surf = 0"))
print(ss_results)
## Linear hypothesis test
## 
## Hypothesis:
## stra = 0
## surf = 0
## 
## Model 1: restricted model
## Model 2: points ~ attitude + stra + surf
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    164 4641.1                           
## 2    162 4544.4  2    96.743 1.7244 0.1815

We get the P-value of 0.182 (>0.05) which means we cannot reject the joint null hypothesis at 95% confidence level.

Lets now drop these two ‘redundant’ variables. The new regression model is \[points_i \sim \beta_1 +\beta_2attitude_i+\epsilon_i\]

# create a regression model without redundant variables
new_model <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(new_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now the Multiple \(R^2\) is almost the same as earlier, even if we dropped the two variables. This is good because adding more variables usually makes \(R^2\) grow. We can interpret this so that the model has a better fit. Residuals have now smaller variation. The explanatory variable explains ~19% of the variance in the exam points.

Evaluating the assumptions of the model

The assumptions made
1. errors are normally distributed
2. errors are not correlated
3. the errors have constant variance (homoscedasticity)

\[\epsilon\sim N(0,\sigma^2) \]

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model, which=c(1,2,5))

Assessing assumptions visually

Looking at the Q-Q-plot, the assumption of normality of the residuals seems reasonable, as most of the points are located on the line, even if some outliers can be seen at the edges.

No obvious pattern emerges in the plotting of residuals vs fitted values (variance seems constant) and none of the observations have big leverage over the whole regression.

Visual inspection implies that these assumptions are met. It’s however a bit hard to say for sure that the variance is constant, so few tests might be in place to confirm this.

Testing if variance is a constant

We can test for multiplicative heteroscedasticity using Breusch-Pagan test, testing against \(H_0\) of no heteroscedasticity.

library(lmtest)
library(dplyr)
new_model %>% 
  lmtest::bptest() -> bpresults
print(bpresults)
## 
##  studentized Breusch-Pagan test
## 
## data:  .
## BP = 0.0039163, df = 1, p-value = 0.9501

We get p-value = 0.95, so cannot reject null hypothesis - we’ve found no evidence against homoscedasticity.

The White test is a generalization of the Breusch-Pagan test and may detect more general forms of heteroscedasticity.

whiteresults <- bptest(new_model, ~ attitude + I(attitude^2), data = learning2014)
print(whiteresults)
## 
##  studentized Breusch-Pagan test
## 
## data:  new_model
## BP = 0.088873, df = 2, p-value = 0.9565

With the White test we get a p-value of 0.957, which means that we have not found any evidence against the assumptions of our linear regression model.


Analysis part of the R-exercise 3

1 & 2 Preparing the data

The data are from two identical questionaires related to secondary school student alcohol comsumption in Portugal. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). You can find more information here.

Following adjustments have been made:
1. The variables not used for joining the two data have been combined by averaging (including the grade variables)
2. alc_use is the average of Dalc and Walc
3. high_use is TRUE if alc_use is higher than 2 and FALSE otherwise

Let’s takea look at the names of the variables in our dataset:

alc <- read.csv("https://raw.githubusercontent.com/BreezewindX/IODS-project/master/data/alc.csv")

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

3 Choosing the variables

I’ll choose high_use as dependent variable and failures, absences and sex as explanatory variables.

My intuition is that
* a high use of alcohol is likely related to absences and also failures.
* drinking might lead to failing and failing might induce even more drinking
* absences might be indicator of failures by itself, even without high alcohol consumption, as there can be multiple reasons for these absences.
* alcohol consumption habits might differ between men and women.

The logistic regression model is \[logit(p)=log(p/(1-p))=highuse \sim \beta_1+\beta_2failures+\beta_3absences+\beta_4sex+\epsilon_i\]

4 Plotting the data

library(ggplot2)
ggplot(alc, aes(x = high_use, fill = sex)) + 
  geom_bar(position = "fill")

Majority of the students with high alcohol consumption were men.

library(ggpubr)
library(dplyr)
dens_score<- ggplot(alc, aes(x = G3)) +
geom_density()
x <- seq(1, 18, length.out=100)
df <- with(alc, data.frame(x = x, y = dnorm(x, mean(G3), sd(G3))))
dens_score2<-dens_score + geom_line(data = df, aes(x = x, y = y), color = "red")

dens_fail <- ggplot(alc, aes(x = failures, fill = high_use)) +
geom_bar(position="fill")

dens_abs <- ggplot(alc, aes(x = absences, fill = high_use)) +
geom_bar(position = "fill")

dens2_abs <- ggplot(alc, aes(x = absences, fill = high_use)) +
geom_histogram()

#  facet_wrap(~ high_use)
ggarrange(dens_score2, dens_fail, dens_abs, dens2_abs, 
          labels = c("A", "B", "C", "D"),
          ncol = 2, nrow = 2)

From A we can see that the scores are somewhat, even if not completely, normally distributed.

B shows the number of failures as a ratio between students with high and low alcohol consumption. We can see that the share of students with high alcohol use grows, as the amount of failures grow. The findings support the hypothesis that with high alcohol consumption grows the risk of failure.

Also, in C the share of students with high alcohol consumption grows as the amount of absences grow. This supports the view that students with high alcohol consumption have more absences than the students with low alcohol consumption.

D is graph C in absolute values, where we can see that the mumber of absencees is diminishing. There are some outliers in the data.

library(ggplot2)

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")

Distribution of grades seem more symmetric among students with low alcohol consumption, and grades of both sexes have more spread than in the case of high consumption. Grades of men are clearly more affected by heavy consumption, than women’s, as the mean of grades for women is almost the same in both cases. For men, there are outliers who received bad grades and how much they drink does not seem to affect it.

# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("abcences") + ggtitle("Student absences by alcohol consumption and sex")

Absences seem to increase when alcohol consumption is high, which makes sense as high consumption has tendency to make you feel like crap the next morning. Again, we can see that absences of men are more affected by drinking. Women have more absences than men when consuming a little and there are pretty high amount of absences for the outliers in the women’s data.

library(GGally)
vars <- c("high_use","failures","absences","sex")
ggpairs(alc, columns = vars, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist")))

Last but not least, we can see that there’s not much correlation between failures and absence. There was about the same amount of both of the sexes in the data, even if there was slightly more women.

5 Interpreting the logistic regression

# find the model with glm()
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1855  -0.8371  -0.6000   1.1020   2.0209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.90297    0.22626  -8.411  < 2e-16 ***
## failures     0.45082    0.18992   2.374 0.017611 *  
## absences     0.09322    0.02295   4.063 4.85e-05 ***
## sexM         0.94117    0.24200   3.889 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 424.40  on 378  degrees of freedom
## AIC: 432.4
## 
## Number of Fisher Scoring iterations: 4
#Round the coefficents for cleaner look
clean <- round(coef(m), digits=3)
print(clean)
## (Intercept)    failures    absences        sexM 
##      -1.903       0.451       0.093       0.941
mean(m$residuals)
## [1] 0.02143222

Here we can see the summary of the model. All explanatory variables are statistically significant at 95% confidence level. Absences and sex are significant even with 99.9%.

For every one unit change in failures, the log odds of high_use (versus low-use) increases by 0.451.
For every one unit change in absences, the log odds of high_use (versus low-use) increases by 0.093.
If the person in question is a male, the log odds of high_use increases by 0.941.

Therefore this fitted model says that, holding failures and absences at a fixed value, the odds of a high alcohol consumption for a male (sex = 1) over the odds of high alcohol consumption for female (sex = 0) is exp(0.941166) ≈ 2.563. In terms of percent change, we can say that the odds for men are 156% higher than the odds for women. The coefficient for failures says that, holding absences and sex at a fixed value, we will see 57% increase in the odds of having high alcohol consumption for a one-unit increase in failures since exp(0.4508198) ≈1.57. Variable absences has the same interpretation.

Residuals of the regression are close to mean zero, which is what we would want.

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp

# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %   97.5 %
## (Intercept) 0.1491257 0.09395441 0.228611
## failures    1.5695984 1.08339644 2.294737
## absences    1.0977032 1.05169654 1.150848
## sexM        2.5629682 1.60381392 4.149405

Here we can see the odds ratios for the variables and their corresponding confidence intervals. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function.

Following is an excerpt from here.

An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0). For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor. If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).

From the odds ratios we have here we can say that it’s more likely that student has high alcohol consumption if there are failures present and if the student is a male the odds are over 2.5 times as high as if they were female.
Looks like the odds ratio is a square of the regression coefficient.

The confidence interval can be interpreted so that the true value of the coefficient belongs to this interval at 95% of the times. Confint function is showing two-tailed, from the 2.5% point to the 97.5% point of the relevant distribution, which form the upper and lower limits of the intervals.

Lets recap my intuition in the beginning of this analysis part:

  • a high use of alcohol is likely related to absences and also failures.

Seems like failures and high use of alcohol are related. We can see this from the statistically significant coefficients of the regression.

  • absences might be indicator of failures by itself, even without high alcohol consumption, as there can be multiple reasons for these absences.

Absences are also statistically significant, but presence of them does not raise the odds that much. There is probably many more reasons for being absent than alcohol intake.

  • alcohol consumption habits might differ between men and women.

The data-analysis would seem to support the evidence, that it’s more likely that men have high alcohol consumption in the study group, and that they also get lower grades and are more absent with high alcohol consumption.

6 Exploring the predictive power of the chosen model

Target variable versus the predictions (2x2) tabulation

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   259    9
##    TRUE     84   30

We can see that the model predicted quite accurately when high_use was false, as we got only 9 predictions wrong out of 268.

However, it did not do so well when predicting when it’s true, as prediction was false 84 times out of 114. Graphical representation follows.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67801047 0.02356021 0.70157068
##    TRUE  0.21989529 0.07853403 0.29842932
##    Sum   0.89790576 0.10209424 1.00000000

Computing the total proportion of inaccurately classified individuals (= the training error):

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss <- loss_func(class = alc$high_use, prob = alc$probability)
print(loss)
## [1] 0.2434555

On average, our model predicts the end result wrong approx. 24% of the predictions in the training data.

7 10-fold cross-validation on the model

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2591623

Average number of wrong predictions in the cross validation is approximately 0.26. My chosen model is exactly the same as in the datacamp exercise, so of course the prediction error is the same. When comparing different models, we would prefer a model with smaller penalty (i.e. error). If one could re-select the explanatory variables so that the error would be smaller, that model could be preferred in prediction over this one. If more variables are added to the regression, the average error grows.


Analysis part of the R-exercise 4

1 Loading the inbuilt dataset Boston

library(MASS)
# load the data
data("Boston")

2 A brief overlook over the data

library(MASS)
# load the data
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Boston dataframe has 506 observations of 14 variables:

  • crim - per capita crime rate by town.
  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus - proportion of non-retail business acres per town.
  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox - nitrogen oxides concentration (parts per 10 million).
  • rm - average number of rooms per dwelling.
  • age - proportion of owner-occupied units built prior to 1940.
  • dis - weighted mean of distances to five Boston employment centres.
  • rad- index of accessibility to radial highways.
  • tax - full-value property-tax rate per $10,000.
  • ptratio - pupil-teacher ratio by town.
  • black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat - lower status of the population (percent).
  • medv - median value of owner-occupied homes in $1000s.

The information about the dataset can be found here.

3 Plotting the data

# plot matrix of the variables
pairs(Boston)

Pairs graph is a mess so far, so let’s wait if we can discard some variables and then maybe try drawing it again…

Data correlations

Let’s draw the correlations between the variables.

library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)

We can see that the most correlated variables are

library(reshape2)
CM <- cor_matrix                           # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA       # lower tri and diag set to NA
subset(melt(CM, na.rm = TRUE), abs(value) > .7)
##      Var1 Var2 value
## 59  indus  nox  0.76
## 89    nox  age  0.73
## 101 indus  dis -0.71
## 103   nox  dis -0.77
## 105   age  dis -0.75
## 129 indus  tax  0.72
## 135   rad  tax  0.91
## 195 lstat medv -0.74

Variables rad and tax are highly positively correlated (0.91). This can be interpretated so that the properties with better access to city’s radial highways also have higher property tax.

Next ones are nox and dis with -0.77 negative correlation. Interpretation is that smaller the weighted average distance to the five Boston employment centers, the larger is the nitrogen oxides concentration (worse air pollution).

As a third, with 0.76 positive correlation can be found ìndus and nox. Interpretation is that higher the proportion of non-retail business acres in town, higher is the concentration of the nitrogen oxides.

(One thing that can be noted is that the status of the inhabitants and the number of rooms correlate strongly with the housing value.)

For convenience, let’s plot histograms only for these variables of interest.

Distributions of the chosen variables

library(ggpubr)
library(ggplot2)
x <- seq(1, length.out=dim(Boston)[1])

rad_plot<-ggplot(Boston, aes(x=rad)) +
geom_histogram()

tax_plot <- ggplot(Boston, aes(x=tax)) +
geom_histogram()

nox_plot <- ggplot(Boston, aes(x=nox)) +
geom_histogram()

dis_plot <- ggplot(Boston, aes(x=dis)) +
geom_histogram()

indus_plot <- ggplot(Boston, aes(x=indus)) +
geom_histogram()

ggarrange(rad_plot, tax_plot, nox_plot, dis_plot, indus_plot, 
          labels = c("A", "B", "C", "D", "E"),
          ncol = 3, nrow = 2)

Plot A rad is index of accessibility to radial highways. There are two concentrations in the graph, first one is index of 4-5 and the second index 24. There are 132 observations in the latter one (with index value 24), which is a high percentage of all the observations. It would make sense to check out the map of Boston to better understand the outliers.

Plot B tax has the full-value property-tax rate, and the distribution is pretty similar to what we had in plot A, i.e. concentration in the first quartile and high amount of outliers in the last quartile (132 observations, value 666).

Plots C and D (nox and dis) have also similar distributions skewed to the right.

Plot E indus is skewed to the right, and has a high count (132 observations) on a single value (18.1).

There seems to be 132 observations in the data, that are outliers for some reason.

4 Standardising the dataset

boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaled dataset is of type matrix, so we’ll convert it into a dataframe. In the new dataframe all the variables have zero mean.

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Linear discriminant analysis produces results based on the assumptions that

  • the variables are normally distributed (on condition of the classes),
  • the normal distributions for each class share the same covariance matrix, and
  • the variables are continuous

Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

# look at the table of the new factor crime
#table(crime)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5 Fitting and plotting the linear discriminant analysis on the train set

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
#lda.fit

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

From the graph we can see the variables that imply high crime, the most important factor being rad (the arrows pointing at the high crime cluster).

6 Predict the classes with the LDA model on the test data

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       21      11        0    0
##   med_low    3      16        3    0
##   med_high   1       7       15    1
##   high       0       0        0   24

The model seems to predict high crime very well (all are correct) and taking a look at the predictions as a whole, the model seems to do better predicting higher rates. In all cases, the model had predicted correctly more often than wrong.

7 Standardising the original dataset and calculating distances

First we calculate the Euclidean distance, and then with Manhattan distance.

library(factoextra)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- get_dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Calculating the K-means

############
# K-means  #
############
k_max <- 10
set.seed(123)

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results

qplot(x = 1:k_max, y = twcss, geom = 'line')

From the graph we choose to have two centers, as it is the point where the slope of the line changes from steep to level (using the elbow method). Let’s run K-means with two centers.

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)

    
library(tidyverse)
library(data.table)
boston_scaled %>%
  mutate(Cluster = km$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean") %>%
DT::datatable(extensions = 'FixedColumns', options = list(dom = 't',
  scrollX = TRUE,
  scrollCollapse = TRUE))

You can scroll the datatable to see the rest of the variables, which are summarised by their means grouped by cluster number. We can see that there are differences between the two clusters, e.g. cluster #2 has higher crime, bigger proportion of non-retail business acres, more pollution, buildings are older and it’s further from the employment centers, there are less teachers per pupil and the median value of owner-occupied homes is lower.

Lets draw pairs according to the two clusters:

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

Color red is cluster #2 and black is nicer neighbourhood cluster #1.

We can see from the graph what we earlier saw numerically from the datatable. In the upper row we have crime against the variables, and we can see that cluster #2 has higher crime rates in every aspect than cluster #1.

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[1:5], col = km$cluster)

If we compare crim and indus we see that the red cluster has higher crime and it is more industrialised than the black one. It’s the same with variable nox, and it’s logical that there is also more air pollution. zn tells us that red cluster is not near the river area, unlike the black cluster.

We could go on and on analysing all the variables in the graph, but I’m certain it’s not the meaning of this exercise.

Bonus

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
k3 <- kmeans(boston_scaled, centers = 3)
k4 <- kmeans(boston_scaled, centers = 4)
k5 <- kmeans(boston_scaled, centers = 5)
k6 <- kmeans(boston_scaled, centers = 6)
k7 <- kmeans(boston_scaled, centers = 7)

# plots to compare
p2 <- fviz_cluster(k3, geom = "point",  data = boston_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = boston_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = boston_scaled) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point",  data = boston_scaled) + ggtitle("k = 6")
p6 <- fviz_cluster(k7, geom = "point",  data = boston_scaled) + ggtitle("k = 7")

library(gridExtra)
grid.arrange(p2, p3, p4, p5, p6, nrow = 3)

I choose k=3 as it has the best separation in my opinion.

set.seed(123)
km <- boston_scaled %>%
  kmeans(centers = 3)
lda.fit <-lda(km$cluster~.,data=boston_scaled)

classes<-as.numeric(km$cluster)
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 4)

Rad and tax (and maybe indus) are the most influencial linear separators for cluster #2. Age and chas for cluster #1. Dis and rm for cluster #3. Black affects towards clusters #1 and #3. Selecting the right amount of clusters seem to be important, so it’s useful to know some methods to choose optimal amount of them (for example using the elbow method we used earlier).